Gaussian Mixture Models (GMM) are a kind of hybrid between a clustering estimator and a density estimator. Density estimator is an algorithm which takes a D-dimensional dataset and produces an estimate of the D-dimensional probability distribution which that data is drawn from. GMM algorithm accomplishes by representing the density as a weighted sum of Gaussian distributions.

Kernel density estimation (KDE) is in some senses takes the mixture of Gaussians to its logical extreme: it uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parameteric estimator of density


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd

Motivation for KDE - Histograms

Density estimator is an algorithm which seeks to model the probability distribution that generated a dataset. This is simpler to see in 1-dimensional datam as the histogram. A histogram divides the data into discrete bins, counts the number of points that fall in each bind, and then visualizes the results in an intuitive manner.


In [2]:
def make_data(N, f=0.3, rseed=1087):
    rand = np.random.RandomState(rseed)
    x = rand.randn(N)
    x[int(f*N):] += 5
    return x

x = make_data(1000)

Standard count-based histogram can be viewed from the plt.hist() function. normed parameter of this function makes the heights of the bars to reflect probability density:


In [3]:
hist = plt.hist(x, bins=30, normed=True)


This histogram is equal binned, hence this normalization simply changes the scale on the y-axis, keeping the shape of the histogram constant. normed keeps the total area under the histogram to be 1, as we can confirm below:


In [4]:
density, bins, patches = hist
widths = bins[1:] - bins[:-1]
(density * widths).sum()


Out[4]:
1.0

One problem with histogram as a density estimator is that the choice of bin size and location can lead to representations that have qualitatively different features.

Let's see this example of 20 points, the choice of bins can lead to an entirely different interpretation of the data.


In [6]:
x = make_data(20)
bins = np.linspace(-5, 10, 10)

fig, ax = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True,
                      subplot_kw = {'xlim': (-4, 9), 'ylim': (-0.02, 0.3)})

fig.subplots_adjust(wspace=0.05)
for i, offset in enumerate([0.0, 0.6]):
    ax[i].hist(x, bins=bins+offset, normed=True)
    ax[i].plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)


We can think of histogram as a stack of blocks, where we stack one block within each bins on top of each point in the dataset. Let's view this in the following chart:


In [7]:
fig, ax = plt.subplots()
bins = np.arange(-3, 8)
ax.plot(x, np.full_like(x, -0.1), '|k',
        markeredgewidth=1)
for count, edge in zip(*np.histogram(x, bins)):
    for i in range(count):
        ax.add_patch(plt.Rectangle((edge, i), 1, 1,
                                   alpha=0.5))
ax.set_xlim(-4, 8)
ax.set_ylim(-0.2, 8)


Out[7]:
(-0.2, 8)

The effects fo two binnings comes from the fact that the height of the block stack often reflects not on the actual density of points neaby, but on coincidences of how the bins align with the data points. This mis-alignment between points and their blocks is a potential cause of the poor histogram results.

What if, instead of stacking the blocks aligned with the bins, we were to stack the blocks aligned with the points they represent? if we do this the blocks won't be aligned, but we can add their contributions at each location along the x-axis to find the result.


In [12]:
x_d = np.linspace(-4, 8, 2000)
density = sum((abs(xi - x_d) < 0.5) for xi in x)

plt.fill_between(x_d, density, alpha=0.5)
plt.plot(x, np.full_like(x, -0.1), '|k', markeredgewidth=1)
plt.axis([-4, 8, -0.2, 8]);



In [16]:
x_d[:50]


Out[16]:
array([-4.        , -3.993997  , -3.987994  , -3.981991  , -3.97598799,
       -3.96998499, -3.96398199, -3.95797899, -3.95197599, -3.94597299,
       -3.93996998, -3.93396698, -3.92796398, -3.92196098, -3.91595798,
       -3.90995498, -3.90395198, -3.89794897, -3.89194597, -3.88594297,
       -3.87993997, -3.87393697, -3.86793397, -3.86193097, -3.85592796,
       -3.84992496, -3.84392196, -3.83791896, -3.83191596, -3.82591296,
       -3.81990995, -3.81390695, -3.80790395, -3.80190095, -3.79589795,
       -3.78989495, -3.78389195, -3.77788894, -3.77188594, -3.76588294,
       -3.75987994, -3.75387694, -3.74787394, -3.74187094, -3.73586793,
       -3.72986493, -3.72386193, -3.71785893, -3.71185593, -3.70585293])

Rough edges are not aesthetically pleasing, nor are they reflecting of any true properties of the data. In order to smooth them out, we might decide to replace the blocks at each location with a smooth function, like a Gaussian. Let's use a standard normal curve at each point instead of a block:


In [19]:
from scipy.stats import norm
x_d = np.linspace(-4, 8, 1000)
density = sum(norm(xi).pdf(x_d) for xi in x)

plt.fill_between(x_d, density, alpha=0.5)
plt.plot(x, np.full_like(x, -0.1), '|k', markeredgewidth=1)
plt.axis([-4, 8, -0.2, 5]);


This smoothed-out plot, with a Gaussian distribution contributed at the location of each input point, gives a much more accurate idea of the shape of the data distribution, and one which has much less variance (i.e., changes much less in response to differences in sampling).

Kernel Density Estimation in Practice

Free parameters of the kernel density estimation are the kernal, which specifies the shape of the distribution placed at each point and the kernel bandwith, which controls the size of the kernel at each point. Scikit-Learn has a choice of 6 kernels.


In [23]:
from sklearn.neighbors import KernelDensity

# instantiate and fit the KDE model
kde = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde.fit(x[:, None])

# score samples returns the log of the probability density
logprob = kde.score_samples(x_d[:, None])

plt.fill_between(x_d, np.exp(logprob), alpha=0.5)
plt.plot(x, np.full_like(x, -0.01), '|k', markeredgewidth=1)
plt.ylim(-0.02, 0.30);


Selecting the bandwidth via cross-validation

Choice of bandwith within KDE is extremely important to finding a suitable density estimate, and is the knob that controls the bias-variance trade-off in the estimate of density: too narrow a bandwidth to a high-variance estimate (over-fitting) where the presence or absence of a sinple point makes a large difference.

In machine learning contexts, we've seen that such hyperparameter tuning often is done empirically via a cross-validation approach.


In [28]:
KernelDensity().get_params().keys()


Out[28]:
dict_keys(['algorithm', 'atol', 'bandwidth', 'breadth_first', 'kernel', 'leaf_size', 'metric', 'metric_params', 'rtol'])

In [29]:
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import LeaveOneOut

bandwidths = 10 ** np.linspace(-1, 1, 100)
grid = GridSearchCV(KernelDensity(kernel='gaussian'), {'bandwidth': bandwidths}, cv=LeaveOneOut(len(x)))

grid.fit(x[:, None])


Out[29]:
GridSearchCV(cv=sklearn.cross_validation.LeaveOneOut(n=20),
       error_score='raise',
       estimator=KernelDensity(algorithm='auto', atol=0, bandwidth=1.0, breadth_first=True,
       kernel='gaussian', leaf_size=40, metric='euclidean',
       metric_params=None, rtol=0),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'bandwidth': array([  0.1    ,   0.10476, ...,   9.54548,  10.     ])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [30]:
grid.best_params_


Out[30]:
{'bandwidth': 0.6428073117284322}

In [ ]: